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Abstract 

In  the  future  war  against  terror  and  new  types  of  offen¬ 
sive  activities  away  from  home,  telemedical  systems,  in¬ 
cluding  a  telesurgical  system,  may  become  standard  mili¬ 
tary  medical  equipment.  In  recent  years,  there  have  been 
significant  technological  advances  in  both  telecommunica¬ 
tions  and  robotics.  These  advances  have  made  remotely 
operable  telemedicine  possible.  However,  a  key  technol¬ 
ogy  that  rapidly  encodes,  transmits,  and  decodes  surgical 
video  with  the  minimum  round-trip  delay  and  the  least 
influence  by  network  jitter  (random  fluctuation  of  delay) 
is  not  currently  available.  This  paper  presents  a  special- 
purpose  video  coding  method  to  support  telesurgery,  tele¬ 
monitoring,  and  teleconsultation,  with  special  emphasis  on 
telesurgery.  Our  method  utilizes  advanced  image  process¬ 
ing  algorithms  which  prioritize  the  importance  of  the  scene 
shown  on  the  video  screen.  This  prioritization  is  performed 
according  to  a  gaze  map  constructed  based  on  tracking 
the  eye  movements  of  the  remote  observer.  During  net¬ 
work  congestion,  our  system  processes  video  data  more  ag¬ 
gressively  and  transmits  non-essential  data  at  reduced  data 
rates.  As  a  result,  the  essential  information  necessary  to 
perform  surgery  is  protected  against  network  deterioration. 

1  Introduction 

It  has  been  a  classic  problem  to  provide  severely  in¬ 
jured  soldiers  with  time-critical  medical  care,  such  as  a  so¬ 
phisticated  neurosurgery  to  stop  an  ongoing  hemorrhage 
within  the  brain,  within  or  near  the  battlefield.  Recently, 


broadband  telecommunication  channels  and  computer  net¬ 
works  have  connected  the  theater  of  war  to  the  Continen¬ 
tal  United  States  (CONUS)  providing  near  instantaneous 
bi-directional  communications.  Also,  robotic  surgical  sys¬ 
tems  (for  example,  the  da  Vinci  Surgical  System)  are  be¬ 
ing  developed  and  experience  is  being  gained  in  their  use. 
These  technological  advances  in  telecommunications  make 
it  possible  to  provide  telemedicine  services  remotely  and, 
when  combined  with  surgical  robots,  to  allow  an  expert  sur¬ 
geon  in  CONUS  to  operate  on  an  injured  soldier  without 
being  limited  by  geometrical  distances  and  being  exposed 
to  the  risk  of  injury  in  the  battlefield. 

Because  of  the  high  bandwidth  requirement,  video 
transmission  plays  the  most  critical  role  in  the  telesurgery 
information  pathway.  In  the  past  twenty  years,  extensive 
research  on  video  coding,  which  aims  at  data  compres¬ 
sion,  has  produced  a  number  of  widely  utilized  video  cod¬ 
ing  standards,  such  as  MPEG-2,  MPEG-4,  and  H264[l, 
2,  3,  4,  5].  These  standards  have  been  highly  success¬ 
ful  in  the  entertainment  industry,  leading  to  the  develop¬ 
ment  of  technologies  such  as  digital  video  disks  (DVDs) 
and  high  definition  video  disks  (HDVDs),  and  video  broad¬ 
casting.  However,  interactive  video  applications,  such  as 
telesurgery,  impose  additional  requirements  on  the  video 
coding  system,  including  low  delay,  high  scalability  in 
video  streaming,  and  short  encoding  and  decoding  laten¬ 
cies.  Unfortunately,  these  requirements  are  not  fully  sup¬ 
ported  by  the  existing  standards  beyond  provisions  of  rudi¬ 
mentary  rate  distortion  controls [6,  7]. 

Among  the  application- specific  requirements,  network 
delay  and  jitter  have  been  identified  to  be  the  key  problems 
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affecting  the  performance  of  telesurgery,  leading  to  insta¬ 
bility  in  the  control  of  remote  surgical  robots[8,  9,  11].  Two 
major  sources  of  delay  have  been  observed  in  the  transmis¬ 
sion  of  surgical  video  [10].  The  first  source  is  the  data  pro¬ 
cessing  time  required  for  digitization,  encoding,  decoding, 
and  rendering.  This  type  of  delay  can  be  controlled  us¬ 
ing  advanced  computational  architectures  (e.g.,  pipeline  or 
parallel  processing)  and  dedicated  hardware  such  as  FPGA 
and  DSP  chips.  In  general,  an  increase  in  the  computational 
complexity  requires  additional  processing  time.  However, 
this  increase  in  time  does  not  necessarily  cause  an  increase 
in  delay  because  digital  video  rendering  can  be  consid¬ 
ered  as  a  series  of  discrete  events  occurring  at  equally  sep¬ 
arated  frame  intervals.  For  example,  all  data  processing 
tasks  would  produce  the  same  33.3  ms  delay  if  the  total 
amount  of  processing  can  be  completed  within  33.3  ms,  as¬ 
suming  a  30  frames/sec  frame  rate.  The  second  source  of 
delay  is  due  to  the  network.  This  delay  can  be  subdivided 
into  three  categories [12]:  propagation,  serialization ,  and 
queuing.  The  propagation  delay  is  simply  due  to  the  travel 
time  of  the  electrical  signal  between  Point  A  and  Point  B 
(e.g.,  66.7  ms  round-trip  between  Los  Angeles  and  Bei¬ 
jing).  Without  a  significant  routing  change,  this  delay  is 
close  to  a  constant,  implying  that  the  variation  in  this  de¬ 
lay  due  to  propagation  is  small.  The  serialization  delay  is 
the  time  spent  on  transmitting  a  single  packet.  It  depends 
on  the  bandwidth  of  the  network  connection  and  the  size  of 
the  packet.  For  example,  a  10M  bps  Ethernet  connection 
transmits  a  188  byte  packet  in  0.15  ms  (assuming  100%  ef¬ 
ficiency)  while  it  takes  a  56K  modem  (at  bit  rate  203 K  bps) 
16.7  ms  for  the  same  packet  (with  10  synchronization  bits 
for  each  byte)  [14].  For  a  fixed  packet  size  and  stable  rout¬ 
ing,  this  delay  is  near  constant  and  the  corresponding  jitter 
is  small.  The  last  type  of  delay,  queuing  delay ,  is  the  time 
that  a  packet  spends  waiting  within  the  router  queues.  This 
delay  is  determined  by  the  network  traffic.  Without  con¬ 
gestion  this  delay  is  negligible.  With  heavy  congestion  this 
delay  becomes  substantial.  Therefore,  the  queuing  delay 
is  usually  the  most  significant  delay  component  in  today’s 
IP  based  network  and  the  jitter  resulting  from  the  variation 
in  this  queuing  delay  is  dominant.  Both  the  queuing  de¬ 
lay  and  jitter  are  key  factors  affecting  the  performance  of 
telesurgery  [10,  12,  15]. 

From  the  above  analysis  ,  it  can  be  observed  that  the 
delay  and  jitter  cannot  be  totally  eliminated  because  a  min¬ 
imum  time  for  data  processing  and  transmission  is  required 
and  the  rates  of  processing  and  transmission  are  not  al¬ 
ways  constant.  There  are  two  obvious  solutions.  First,  a 
dedicated  connection  can  bring  the  queuing  delay  close  to 
zero  if  the  bandwidth  is  high  enough.  Second,  a  quality-of- 
service  (QoS)  enabled  connection,  which  assigns  a  higher 


priority  to  the  telesurgical  data,  would  substantially  reduce 
delay  and  jitter.  However,  the  availability  of  these  op¬ 
tions  cannot  be  assumed  in  the  theater  of  war,  especially 
in  the  cases  where  an  inter-continental  network  connection 
through  many  countries  is  required. 

In  this  work,  we  present  a  novel  approach  to  reduce 
both  the  serialization  and  queuing  delays,  as  well  as  jitter, 
in  a  different  perspective.  Using  an  eye  tracking  technique, 
we  detect  the  visual  attention  of  the  surgeon  at  the  remote 
site  and  send  the  information  to  the  operating  site.  Since 
the  data  size  of  the  ROI  information  is  only  a  small  num¬ 
ber  of  bytes  and  the  information  is  usually  predictable  us¬ 
ing  the  previously  received  data,  the  delay  caused  by  this 
transmission  is  minimal  or  can  be  eliminated.  In  the  next 
step,  a  region  of  interest  (ROI),  which  reflects  a  real- valued 
gaze  map  indicating  the  importance  of  each  pixel,  is  de¬ 
termined.  The  video  data  at  the  operating  site  are  pre- 
processed  to  preserve  the  quality  of  data  according  the  ROI 
and  the  network  traffic  condition.  In  this  way,  the  required 
bandwidth  is  dynamically  controlled  adapting  to  the  trans¬ 
mission  channel  while  the  critical  information  necessary  to 
perform  surgery  is  nearly  invariant.  Our  experimental  re¬ 
sults  indicate  an  over  50%  reduction  in  bandwidth  with¬ 
out  affecting  the  essential  visual  fidelity  in  the  field  where 
surgery  is  performed.  Using  this  context  based  video  cod¬ 
ing  approach,  the  number  of  packets  that  must  be  transmit¬ 
ted  through  the  network  can  be  scaled  down  significantly, 
so  is  the  likelihood  of  excessive  queuing  delay. 

2  Methods 

Traditionally,  the  quality  of  a  video  is  measured  by  the 
accuracy  of  pixels  in  the  entire  image  with  respect  to  the 
physical  scene.  As  a  result,  all  pixels  in  the  image  are  ex¬ 
pected  to  be  reconstructed  as  close  as  possible  to  their  orig¬ 
inal  values  after  the  encoding,  transmission,  reception,  and 
decoding  cycle.  This  traditional  approach  requires  large 
and  stable  bandwidth  and  this  bandwidth  cannot  be  reduced 
substantially  by  any  video  codec  available  today. 

2.1  Basic  Concepts 

In  telemedical  applications,  especially  in  telesurgery,  the 
sole  purpose  of  transmitting  the  video  is  for  the  surgeon 
to  view  and  manually  operate  within  a  surgical  landscape. 
If  he/she  chooses  not  to  view  a  part  of  the  video  screen 
closely,  high-quality  transmission  of  that  part  becomes 
unnecessary.  By  understanding  how  a  surgeon  visually 
perceives  the  surgical  field,  one  can  present  him/her  pre- 
processed  images  with  high-quality  content  only  within  the 
area  where  he/her  is  attending.  Outside  of  this  region, 
lower-quality  content  is  provided.  As  a  result,  the  surgeon 
collects  the  same  amount  of  visual  information  in  his/her 


brain  necessary  to  perform  surgery,  while  the  data  can  be  2.3  Spatial  Acuity  Modeling 
transmitted  more  rapidly. 


2.2  Eye  Gazing 


Figure  1:  Eccentricity  (in  degrees)  vs.  preceptor  density 
for  cone  and  rod  visual  sensory  cells  (Osterberg,  1935). 

“Gaze”  was  originally  a  psychological  concept  of  vi¬ 
sual  culture  describing  how  people  look  at  each  other  in 
order  to  effectively  gather  interpersonal  or  social  informa¬ 
tion.  In  the  context  of  this  technical  paper,  we  use  “eye 
gazing”  to  describe  the  action  of  attending  to  a  portion  of 
a  visual  scene  for  a  certain  amount  of  time.  During  “gaz¬ 
ing”,  the  eyes  project  a  scene  onto  the  retina.  The  gazing 
point  is  projected  to  its  center  (the  fovea).  Studies  in  vi¬ 
sual  science  have  shown  that  the  image  projected  onto  the 
retina  is  not  perceived  with  a  uniform  resolution.  This  is 
because  the  cones  and  rods  are  not  distributed  uniformly. 
The  density  of  the  light-sensitive  receptors  is  the  highest 
at  the  fovea  and  decreases  precipitously  towards  the  outer 
rim,  as  shown  in  Fig.  1.  The  cones  which  are  sensitive  to 
color  concentrate  at  the  fovea.  This  center  or  vision  covers 
less  than  2  degrees  of  the  visual  field,  compared  with  an 
entire  visual  field  of  160  degrees  in  adult  human.  The  rods, 
sensitive  to  luminance,  have  their  largest  density  slightly 
outside  the  fovea  and  monotonically  decreasing  density  in 
the  outer  boarder. 

The  resolution  of  the  image  projected  on  the  retina  is 
determined  by  the  densities  of  the  cone  and  rod  cells.  This 
resolution,  however,  is  not  the  same  as  that  of  the  im¬ 
age  perceived  by  the  visual  cortex.  The  ganglion  cells  are 
also  distributed  densely  in  the  central  region  of  retina  and 
coarsely  outside.  Within  the  fovea,  the  signals  collected  by 
the  cone  cells  are  integrated  by  ganglion  cells  at  a  resolu¬ 
tion  of  0.03  degrees.  Outside  the  fovea,  the  rod  cells  are 
connected  to  ganglion  cells  at  a  resolution  of  3  degrees,  a 
hundred  times  coarser  than  those  within  the  fovea.  There¬ 
fore,  detailed  vision  is  only  present  at  the  central  region  of 
the  eye,  a  fractional  part  of  the  entire  visual  field. 


Figure  2:  Spatial  acuity  model  with  parameter  k=0.2 

We  utilize  the  basic  concepts  in  visual  science  to  con¬ 
struct  the  gaze  map  in  the  development  of  the  novel  adap¬ 
tive  video  codec.  Previous  psychophysical  tests  suggest 
that  visual  acuity  is  nearly  halved  at  one  degree  from  the 
foveal  center  and  decreased  to  one  quarter  at  five  degrees 
from  the  foveal  center  [16].  Several  acuity  models  have 
been  proposed  based  on  empirical  studies  [17,  18].  Since 
these  models  have  similar  attenuating  profiles,  we  utilize 
the  following  simple  model  [19]: 


A(  x)  = 


1 

1  +  kO(x) 


(1) 


where  A(x )  and  6{x)  are,  respectively,  the  acuity  and  ec¬ 
centricity  angle  at  pixel  location  x  (see  Fig.  2),  and  k  is  a 
constant.  The  value  of  k  can  be  obtained  empirically  ac¬ 
cording  to  video  acquisition  system  utilized[20,  21]. 


2.4  Measurement  of  Eye  Movements 

The  “gazing”  map  to  be  constructed  is  a  time- varying  func¬ 
tion  with  a  dynamically  changing  location  of  it’s  peak.  This 
location  must  be  determined  by  eye  tracking.  Although 
we  have  not  construct  an  eye  tracking  system  because  of 
the  availability  of  commercial  systems  (e.g.,  LC  Technolo¬ 
gies,  Inc.,  Fairfax,  Virginia),  three  commonly  used  meth¬ 
ods  are  briefly  described  here.  In  the  first  method,  a  small 
device  is  attached  to  the  eye,  such  as  a  special  contact 
lens  with  an  embedded  mirror  or  magnetic  field  sensor. 
This  method  seems  to  provide  the  most  accurate  results, 
but  is  cumbersome  to  use.  The  second  method  relies  on 
the  electo-oculogram  (EOG)  measured  from  several  spe¬ 
cially  arranged  skin- surface  electrodes.  This  methods  suf¬ 
fers  from  the  same  problem  as  the  first  method  and  the 
results  seem  to  be  unreliable.  The  third  method  utilizes 
natural  or  infrared  light  to  be  reflected  from  the  eye  and 


sensed  by  a  digital  video  camera.  Digital  image  processing 
techniques  are  then  applied  to  extract  eye  movements  from 
changes  in  reflections.  This  method  appears  to  be  promis¬ 
ing  in  our  case  because  it  does  not  require  any  attachment 
to  the  eye  or  the  skin. 

2.5  Pre-Processing  Algorithms 

We  have  investigated  a  content-based  video  coding  al¬ 
gorithm  using  adaptive  bilateral  pre-processing.  When 
compared  to  alternative  algorithms  which  combine  data 
processing  and  coding  procedures,  such  as  the  adaptive 
wavelet-MPEG  coding [13],  the  bilateral  pre-processing  ap¬ 
proach  is  attractive  in  our  application  because  it  provides 
freedom  for  the  users  to  choose  a  video  coding  standard 
after  pre-processing.  This  allows  an  easy  cascade  of  our 
pre-processing  module  to  any  high-performance  hardware 
encoding  engine.  It  also  allows  switches  among  a  num¬ 
ber  of  encoding  engines  for  different  applications  or  net¬ 
working  environments.  In  the  hardware  implementation  of 
the  new  video  coding  method,  we  suggest  to  use  a  cascade 
of  the  pre-processing  and  encoding  modules,  each  running 
in  a  pipeline  fashion.  With  optimized  computational  algo¬ 
rithms  and  dedicated  hardware,  we  expect  to  complete  the 
pre-processing  task  of  each  image  within  one  frame  inter¬ 
val.  Because  the  current  encoding  module  usually  requires 
longer  processing  time  than  the  pre-processing  module,  the 
delay  caused  by  pre-processing  is  expected  to  be  small  and 
the  gain  in  transmission  speed  due  to  reduced  data  size  is 
expected  to  be  much  greater  than  the  time  lag  due  to  pro¬ 
cessing  delay. 

2.6  Mathematics  Formulation 

The  purpose  of  pre-processing  is  to  remove  the  unneces¬ 
sary  contents  outside  the  region  of  interest  where  small 
details  are  intentionally  ignored  by  the  surgeon.  Our  pre¬ 
processing  will  be  adaptive  according  to:  1)  the  gazing  map 
constructed  by  eye-tracking,  and  2)  the  network  condition 
fed  back  via  the  transmission  channel. 

The  bilateral  filter  is  a  nonlinear  image  filter  which  can 
smooth  image  regions  without  blurring  the  edges.  This  is 
meaningful  for  our  application  because  edges  are  more  eas¬ 
ily  perceived  during  saccadic  or  quick  voluntary  eye  move¬ 
ments.  Therefore,  strong  edges  should  be  best  preserved 
after  pre-processing.  To  achieve  this  goal,  a  bilateral  filter 
is  designed  to  perform  a  weighted  averaging  in  a  neighbor¬ 
hood  centered  at  a  reference  pixel.  Higher  weights  are  as¬ 
signed  to  the  pixels  that  are  closer  in  both  space  and  inten¬ 
sity  to  the  reference  pixel.  Mathematically,  given  an  input 
image  I(x),  the  output  image  J(x)  is  obtained  by: 

j,-'  Y,L-sT,Sj=-sIlyXi+i,X2+j)w{x,0 

J(x)  = - 4 « - = -  (2) 

£f=-sE-=-sM*,  0 


where  w(x,£)  =  c(x,£)s(x,£)  is  a  kernel  function,  x  = 
(xi,  X2)  and  £  =  (£1,  £2)  are,  respectively,  space  variables 
of  the  current  and  reference  pixels,  and  7  ==  (/1,  J2,  73)  is 
the  intensity  value  of  a  color  pixel.  Note  that  the  kernel 
function  w  in  the  convolution  is  the  product  of  the  func¬ 
tions  c  and  s,  which  represent  the  “closeness”  in  the  do¬ 
main  (space)  and  in  the  range  (intensity),  respectively.  For 
simplicity,  we  utilize  Gaussian  functions  to  form  the  kernel 
function 


w(x,  £)  =  exp 


/Hi£-*ii2\ 
V  2af,  ) 


exp 


V  ^aR 


(3) 

which  is  controlled  by  two  parameters,  <jd  and  or  .  The 
former,  called  geometric  spread,  is  determined  by  the  de¬ 
sired  amount  of  low-pass  filtering.  Given  the  acuity  map, 
this  amount  is  defined  by  the  acuity  of  the  centered  pixel 
in  the  neighborhood.  A  larger  or  results  in  a  more  blur¬ 
ring  effect  as  more  neighboring  pixels  are  involved  for  the 
averaging.  The  latter,  called  photometric  spread,  is  chosen 
to  compensate  the  blurring  effect  on  the  pixels  represent¬ 
ing  edges.  Pixels  that  resembles  the  test  pixel  in  intensity 
have  heavier  weights  in  the  averaging.  The  smaller  the  or, 
the  more  restrictive  this  constraint  is.  In  case  where  the 
test  pixel  is  on  an  edge,  the  discontinuity  of  the  intensity  in 
the  neighborhood  confines  the  averaging  to  the  neighbor¬ 
ing  pixels  that  are  also  on  the  edge,  therefore  preventing 
the  excessive  blurring. 


Figure  3:  (a)  An  edge  (intensity  step  100)  perturbed  by 
Gaussian  noise  with  standard  deviation  equal  to  10  in  in¬ 
tensity.  (b)  Combined  similarity  weights  for  a  23x23  neigh¬ 
borhood  centered  at  a  pixel  slightly  (two  pixels)  to  the  right 
of  the  step,  (c)  The  edge  in  (a)  after  bilateral  filtering  with 
or  —  50  (in  intensity)  and  or,  -  5  (in  pixels)  (courtesy  of 
[22]). 

An  example  shown  in  Fig.  3  explains  this  concept. 
Consider  now  a  sharp  boundary  between  a  dark  and  a  bright 
region,  as  in  panel  (a).  When  the  bilateral  filter  is  cen¬ 
tered,  say,  at  a  pixel  on  the  bright  side  of  the  boundary,  the 
similarity  function  s  assumes  values  close  to  one  for  pix¬ 
els  on  the  same  side,  and  values  close  to  zero  for  pixels  on 
the  dark  side.  The  convolution  mask  is  shown  in  (b)  for  a 


23x23  filter  support  centered  at  a  pixel  slightly  away  (two 
pixels  to  the  right)  from  the  edge  in  (a).  As  a  result,  the 
filter  replaces  the  bright  pixel  at  the  center  by  an  average 
of  the  bright  pixels  in  its  vicinity,  and  essentially  ignores 
the  dark  pixels.  Conversely,  when  the  filter  is  centered  at  a 
dark  pixel,  the  bright  pixels  are  ignored  instead.  Thus,  as 
shown  in  (c),  noise  was  removed  while  the  crisp  edge  was 
preserved. 

2.7  Adaptive  bilateral  filtering 

In  video  processing,  the  geometric  spread  gr  is  utilized  to 
adapt  to  the  acuity  map.  At  any  location  closer  to  the  gazing 
point,  a  smaller  spread  gr  is  utilized  which  produces  less 
smoothing  effect  and  vice  versa.  The  photometric  spread 
gr  is  adapted  in  a  similar  fashion  in  that  a  smaller  cjr  is 
used  at  the  pixels  closer  to  the  gaze  point  to  maintain  the 
contrast  ratio.  To  be  specific,  both  spreads  are  functions  of 
the  acuity  map:  gr(x)  =  &du  —  (&du  ~  gdl)A(x)  and 
gr{x)  =  ctru  —  (( jru  —  ctrl)A(x),  where  gru  and  grl 
are  the  pre-selected  upper  and  lower  limit  of  the  geomet¬ 
ric  spread  and  grr ,  and  grr  are  those  of  the  photometric 
spread,  respectively. 

Besides  adapting  to  the  acuity  map,  we  also  dynami¬ 
cally  reconfigure  the  bilateral  filter  adaptively  to  the  traf¬ 
fic  condition  of  the  network.  In  practice,  the  lag  time  in 
the  feedback  information  from  the  transmission  channel 
can  serve  as  an  indicator  of  the  most  recent  network  sta¬ 
tus  to  which  the  output  bitrate  of  the  codec  must  adapt. 
Clearly,  this  bitrate  can  be  gracefully  controlled  by  prop¬ 
erly  reconfiguring  the  bilateral  filter  which  controls  the  val¬ 
ues  of  gr  and  gr.  Since  this  adaptation  is  built  on  top 
of  the  acuity  map,  we  define  the  values  of  the  lower  and 
upper  limits,  grr/grr  and  grr/grr ,  such  that  gr  and 
gr  are  adjusted  accordingly.  Specifically,  if  we  denote  the 
current  network  capacity  by  Nc,  the  combined  adaptation 
to  both  the  network  condition  and  the  acuity  map  will  be 
gd{x)  =  gdu(Nc)  —  ( gdr(Nc )  —  grr(Nc))A(x)  and 
cfr{x)  =  cfru{Nc)  -  ( cfru(Nc )  -  aRL(Nc))A(x ),  where 
a.  ( Nc )  denotes  that  the  lower/upper  limits  are  functions  of 
the  network  capacity. 


t  =  3T  t  =  4T 


Figure  4:  ROI  shape  changes  with  respect  to  a  constant 
update  interval  T 


3  Experimental  Results 


We  have  performed  experiments  to  evaluate  the  video 
coding  performance  using  the  described  context-based  ap¬ 
proach. 


Figure  5:  Results  of  ROI  tracking.  LEFT:  A  circular  ROI 
is  centered  at  the  tip  of  an  active  surgical  instrument  (a  suc¬ 
tion  tool).  RIGHT:  The  ROI  after  152  ms  when  the  suction 
tool  moves  to  the  lower  left.  The  shape  of  the  ROI  is  pro¬ 
gressively  changed  to  allow  observation  of  both  the  previ¬ 
ous  and  current  locations  along  with  the  trajectory  of  tool 
motion. 

3.1  Gaze  Map  and  Its  Updates 

We  assume  that  a  gaze  map  is  in  the  basic  form  of  Eq. 
1  reflecting  the  visual  attention  of  the  operating  surgeon. 
During  surgery,  the  gaze  map  must  be  updated  constantly, 
not  only  by  the  detected  point  of  fixation,  but  also  by  the 
network  condition  as  discussed  previously.  The  process  of 
the  gaze  map  updates  must  be  sufficiently  smooth  because 
abrupt  updates  distract  the  surgeon’s  attention  and  affect  vi¬ 
sual  quality.  We  utilized  a  smooth  updating  method  which 
is  graphically  explained  in  Fig.  4.  The  ROI  is  updated 
at  equal  time  intervals  T.  We  assume  that,  at  t  =  0,  the 
ROI  is  in  a  circular  shape.  Inside  the  ROI,  the  cross  repre¬ 
sents  the  current  focal  point  of  attention.  The  dashed  circle 
centered  at  the  cross  with  radius  r  is  a  predefined  effec¬ 
tive  range  to  be  used  in  ROI  updates.  At  t  =  0,  no  update 
is  necessary  since  the  effective  range  is  entirely  inside  the 
circular  ROI  (first  panel).  At  t  =  T,  suppose  that  the  tip  lo¬ 
cation  changes  and  a  portion  of  the  effective  range  moves 
outside  of  the  existing  ROI.  Then,  we  define  a  new  ROI  as 
an  enclosed  region  defined  by  the  existing  ROI,  the  effec¬ 
tive  range,  and  the  two  tangential  lines  (second  panel).  At 
t  =  2 T,  the  ROI  is  similarly  updated.  However,  we  do 
not  delete  the  previous  update  in  favor  of  a  gradual  mor¬ 
phing  of  the  shape  (third  panel).  At  t  —  3 T,  the  effective 
range  moves  back  to  the  position  inside  the  default  ROI, 
we  remove  the  update  at  t  =  T  (fourth  panel).  At  t  —  4 T, 
the  effective  range  is  still  within  the  default  ROI,  the  up¬ 
date  of  t  =  2T  is  now  removed.  Fig.  5  shows  the  result 
of  detected  ROI  with  updating,  where  the  video  data  was 


obtained  during  a  neurosurgery  performed  at  the  Univer¬ 
sity  of  Pittsburgh  Medical  Center.  It  can  be  observed  that 
the  updated  ROIs  are  cornerless  and  smooth.  As  the  sur¬ 
geon  moves  instruments  within  the  surgical  field,  the  grad¬ 
ually  changing  ROI  with  the  memory  effect  just  described 
allows  a  natural  visualization  of  the  surgical  landscape.  At 
the  same  time,  the  nonessential  details  outside  the  ROI  is 
gracefully  degraded. 

3.2  Pre-Processing 

In  order  to  evaluate  the  performance  of  the  content-based 
video  coding  technique,  we  utilized  a  video  segment  pro¬ 
vided  by  Intuitive  Surgical,  Inc.  This  video  segment  has  a 
screen  size  of  720  x  576  pixels  and  contains  200  frames. 
One  of  the  original  video  frames  is  shown  in  the  second 
panel  of  Fig.  6  where  a  scene  of  robotic  surgery  by  using 
the  da  VinCi  surgical  system  can  be  observed.  We  utilized 
a  popular  MPEG-4  video  codec  (Windows  Media  Series  9) 
with  the  quality  parameter  value  set  to  90  to  compress  the 
original  video  frames.  The  measured  average  bitrate  for  the 
200  frames  was  2707.08  Kbps. 


Regions 

Geometric 
Spread(  pixels) 

Photometric  Spread 
(intensity  levels) 

ROI(  A(x)=l) 

0 

0 

Transit(  0 . 5<=A(x)<  1 ) 

5 

7 

Non-ROI 
(  0<=A(x)<0.5) 

10 

20 

Table  1 :  Pre-defined  Parameters  for  Regions  of  Different 
Importance 

Because  the  use  of  experimental  eye  tracker  during 
surgery  could  affect  surgical  outcomes,  we  utilized  simu¬ 
lated  eye  gazing  maps  in  our  experiments  based  on  the  the¬ 
oretical  model  presented  previously.  The  top  panel  in  Fig. 
6  shows  one  of  our  simulations  where  the  fixation  point 
of  the  eye  is  located  at  the  center  of  the  screen.  The  cir¬ 
cle  indicates  the  ROI  within  which  the  quality  of  the  video 
is  to  be  preserved.  We  implemented  the  adaptive  bilateral 
filtering  algorithm  which  adjusted  the  geometric  spread  pa¬ 
rameter  to  control  the  degree  of  smoothing  according  to  the 
gaze  map,  and  the  photometric  spread  parameter  to  allow¬ 
ing  more  (less)  smoothing  in  the  direction  along  (across) 
edges  in  the  video  image.  In  the  experiment,  we  predefined 
three  sets  of  parameters  for  regions  with  different  impor¬ 
tance  levels,  denoted  as  ROI,  transit,  and  non-ROI,  respec¬ 
tively,  as  shown  in  Table  1 .  The  pre-processed  200-frame 
video  segment  was  compressed,  again  using  Windows  Me¬ 
dia  Series  9  video  codec  with  the  same  quality  parameter 
of  90.  The  average  bitrate  for  the  pre-processed  video  seg- 


Figure  6:  (a)  Gaze  map  representing  the  visual  acuity  of  the 
surgeon;  (b)  One  of  200  original  video  frames  (720x576 
screen  size)  during  a  robotic  prostatectomy  using  the  da 
VinCi  surgical  system.  The  average  bitrate  is  2707.08 
Kbps,  (c)  The  same  200  frame  after  pre-processing  by  bi¬ 
lateral  filter  based  on  the  gaze  map.  The  average  bitrate  is 
reduced  by  over  50%  (1372.83  Kbps),  (d)  Difference  be¬ 
tween  (b)  and  (c).  The  error  within  the  ROI  is  nearly  zero. 
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ment  was  1372.83  Kbps,  a  more  than  50%  reduction  com¬ 
pared  to  the  un-processed  segment.  Although  the  bitrate  re¬ 
duction  is  significant,  the  qualities  of  the  original  and  pre- 
processed  images  (the  second  and  third  panels  in  Fig.  6) 
are  visually  indistinguishable  when  the  viewer  fixates  at  the 
center  of  each  image.  The  difference  between  the  original 
and  filtered  images  is  shown  in  the  bottom  of  Fig.  6  which 
has  been  scaled  to  the  values  between  0  and  255  to  facili¬ 
tate  display.  Notice  that  a  considerable  amount  of  texture 
was  removed  from  the  original  frame  outside  the  ROI.  It  is 
rather  surprising  that,  without  pre-processing,  these  redun¬ 
dant  “data”  would  have  taken  more  than  half  of  the  band¬ 
width! 


4  Conclusion 

We  have  presented  a  content-based,  special-purpose 
video  coding  method  to  support  low-latency  video  data 
transmission  in  a  number  of  military  telemedical  applica¬ 
tions,  especially  in  robotic  telesurgery.  Our  method  detects 
visual  attention  by  tracking  the  eye  movements  of  the  re¬ 
mote  surgeon.  A  gaze  map  is  constructed  in  which  each 
value  indicates  the  importance  of  the  corresponding  pixel. 
Then,  the  original  video  frames  are  pre-processed  based  on 
the  gaze  map.  A  adaptive  bilateral  filter  has  been  devel¬ 
oped  which  removes  unnecessary  details  in  non-important 
regions  of  the  video  screen  according  to  both  the  visual  at¬ 
tention  information  in  the  gaze  map  and  the  network  traf¬ 
fic  condition.  When  the  network  performance  deteriorates, 
this  adaptive  filtering  is  conducted  more  aggressively  in  the 
regions  where  the  surgeon  pays  less  attention  during  surgi¬ 
cal  manipulation.  Our  experimental  results  show  that,  us¬ 
ing  this  attention  based  video  coding  approach,  more  than 
50%  reduction  in  datarate  can  be  achieved  while  the  qual¬ 
ity  of  the  video  data  necessary  to  perform  a  telesurgery  is 
nearly  invariant. 
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